Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new --no-use-system-time flag to use a deterministic timestamp in built PEX #722

Merged
merged 32 commits into from
May 8, 2019

Conversation

Eric-Arellano
Copy link
Contributor

@Eric-Arellano Eric-Arellano commented May 3, 2019

Problem

Part of #716 to make built PEXes be reproducible.

One of the major sources of unreproducibility is timestamps embedded into the zipfile. Python's zipfile.py uses the current system time when creating a new zipfile, per https://github.com/python/cpython/blob/45e92fc02d57a7219f4f4922179929f19b3d1c28/Lib/zipfile.py#L509. So, the Pex cannot be reproducible if built after the original run.

Solution

Introduce a new flag --use-system-time / --no-use-system-time that will toggle whether we use a deterministic timestamp or keep the default behavior of using the current system time. We will default to --use-system-time for now to keep prior behavior, and then in a future release like 1.70 will switch over to default to reproducible builds.

If --no-use-system-time is specified, we use midnight at January 1, 1980, which is the start of the MS-DOS time used by Zipfiles. We then use this value to construct a ZipInfo with the desired date.

Not respecting SOURCE_DATE_EPOCH

We do not respect the standard env var $SOURCE_DATE_EPOCH often used by tools for deterministic timestamps, as doing so risks reducing the reproducibility of built pexes across different platforms, e.g. if two platforms set the env var differently. Usually, this is supposed to be set for the sake of security to check that a shipped binary was not tampered with, but that is not our primary use case, so we do not respect it both for simplicity of our codebase and to avoid this potential reduction in reproducibility. Refer to https://reproducible-builds.org/docs/source-date-epoch/ for more information.

Result

4 / 7 of the acceptance tests for reproducibility now pass consistently!

When setting CI to always use a deterministic timestamp, CI was green so things will work as expected when we change the default to deterministic timestamps.

@Eric-Arellano Eric-Arellano changed the title WIP: Add new --no-use-system-time flag to use a deterministic timestamp in built PEXe Add new --no-use-system-time flag to use a deterministic timestamp in built PEX May 3, 2019
@@ -302,6 +302,18 @@ def configure_clp_pex_options(parser):
'likely will not be reproducible, meaning that if you were to run `./pex -o` with the '
'same inputs then the new pex would not be byte-for-byte identical to the original.')

group.add_option(
'--use-system-time', '--no-use-system-time',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to suggestions for a better name.

The main use case I want to optimize for is when deterministic time will be the default. For example, atm --use-deterministic-time would probably be a good name, but once we make the switch then --no-use-deterministic-time wouldn't be very descriptive for what the behavior would end up being.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pantsbuild/pants#6094 we made a flag -jar-creation-time which, if not set, defaulted to the current time. Not sure whether this is a good or bad default for pex, but it's some prior art :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@illicitonion : I think that we agreed on the other issue that we want this enabled by default, so the default would need to be stable.

@Eric-Arellano : This name looks good I think?

pex/common.py Outdated Show resolved Hide resolved
pex/common.py Outdated Show resolved Hide resolved
pex/common.py Outdated Show resolved Hide resolved
pex/pex_builder.py Show resolved Hide resolved
We use the result many times, and it is not expected to ever change between a pex run, so this is better set to be a property.

We eagerly evaluate the constant for simplicity of implementation. Because the calculation is so cheap, it is not necessary to lazily evaluate.
If it's a constant, theen it becomes very difficult to test that $SOURCE_DATE_EPOCH works because constants get evaluated at import time, and at import time os.environ does not have the env var set. It will ignore the test setting the value, because the constant never gets re-evaluated.
We have no compelling use case for it and it complicates our code.
Copy link

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! FTR: Daniel is out until our Tuesday night, so you might wait for a John review and then go ahead with landing.

pex/common.py Outdated Show resolved Hide resolved
pex/common.py Outdated Show resolved Hide resolved
pex/common.py Outdated Show resolved Hide resolved
pex/common.py Outdated Show resolved Hide resolved
@jsirois
Copy link
Member

jsirois commented May 7, 2019

RE: the SOURCE_DATE_EPOCH - this is just a flag in spirit that allows external tools like Pants to control the timestamp. Presumably Pants would pass a single value no matter what platform it was run on. As such I don't really follow the objection from Pants point of view except for the fact that:

  1. Its not a PEX_ env var so its sneaky / hard to discover.
  2. It does not have a paired pex flag to allow passing the value arguably more explicitly.

All said though I'm happy enough to start with a fixed hardcoded value for simplicity sake. We could always add support later.

@jsirois jsirois mentioned this pull request May 7, 2019
11 tasks
We don't need to explain this here because readers of the code are unlikely to anticipate it being an issue, so are likely to get distracted more than anything. Instead, this PR description is updated to make the explanation if anyone is wondering "wait why didn't they do this?"

Also make the MS-DOS explanation more concise.
Makes the code more consistent and simpler.
@Eric-Arellano
Copy link
Contributor Author

Thanks @jsirois for the thoughtful review. Great suggestions.

I took them and added a function zip_info_from_file() to PermPreservingZipFile. Notably, I do not have the function always use a deterministic timestamp, but only parametrize it so that we can support both cases of system time and deterministic.

This allowed me to refactor zip() to always use writestr(), which imo leads to much more consistent and simpler code.

pex/common.py Outdated Show resolved Hide resolved
We have no need for the more general datetime format yet. Don't make things more general until necessary!
My bad for not testing before pushing.
pex/common.py Outdated Show resolved Hide resolved
pex/common.py Outdated Show resolved Hide resolved
@Eric-Arellano Eric-Arellano merged commit 60fd513 into pex-tool:master May 8, 2019
@Eric-Arellano Eric-Arellano deleted the timestamps branch May 8, 2019 01:20
@Eric-Arellano Eric-Arellano mentioned this pull request Nov 14, 2019
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants